Patent mining: combining dictionary-based and machine-learning approaches
نویسندگان
چکیده
Exploration of the chemical patent space is essential for early-stage medicinal chemistry activities. The BioCreative CHEMDNER-patents task focuses on the recognition of chemical compounds in patents. This includes recognition of chemical named entities in patents (CEMP), classification of chemical-related patent titles and abstracts (CPD), and recognition of genes and proteins in patent abstracts (GPRO). In this study we tackled the CEMP and CPD tasks. We investigated an ensemble system where a dictionary-based approach is combined with a machine-learning approach to extract compounds from text. For this the performance of several lexical resources was assessed using Peregrine, our open source indexing engine. We combined our dictionarybased results on the patent corpus with the results of tmChem, a CRF-based chemical recognizer. To improve the performance of tmChem, three additional feature types where introduced (POS tags, lemmas, and word-vector clusters). When evaluated on the training data, our final system obtained an F-score of 85.21% for the CEMP task, and an accuracy of 91.53% for the CPD task. On the test set, our system ranked sixth among 21 teams for CEMP with an F-score of 86.82%, and second for CPD with an accuracy of 94.23%.
منابع مشابه
A Novel Face Detection Method Based on Over-complete Incoherent Dictionary Learning
In this paper, face detection problem is considered using the concepts of compressive sensing technique. This technique includes dictionary learning procedure and sparse coding method to represent the structural content of input images. In the proposed method, dictionaries are learned in such a way that the trained models have the least degree of coherence to each other. The novelty of the prop...
متن کاملSports Result Prediction Based on Machine Learning and Computational Intelligence Approaches: A Survey
In the current world, sports produce considerable statistical information about each player, team, games, and seasons. Traditional sports science believed science to be owned by experts, coaches, team managers, and analyzers. However, sports organizations have recently realized the abundant science available in their data and sought to take advantage of that science through the use of data mini...
متن کاملIdentification of Chemical Entities in Patent Documents
Biomedical literature is an important source of information for chemical compounds. However, different representations and nomenclatures for chemical entities exist, which makes the reference of chemical entities ambiguous. Many systems already exist for gene and protein entity recognition, however very few exist for chemical entities. The main reason for this is the lack of corpus to train nam...
متن کاملCombining Machine Learning with Dictionary Lookup for Chemical Compound and Drug Name Recognition Task
Following the interest taken into Name Entity Recognition in academic literature in the Gene Mention recognition task of BioCreative I and II, the BioCreative IV hopes to make the implementation of the system in the field of detecting mentions of chemical compounds and drugs. Considering that the machine learning methods have obtained great success in the correct identification of gene and prot...
متن کاملTwitter Sentiment Analysis: Lexicon Method, Machine Learning Method and Their Combination
This paper presents a step-by-step methodology for Twitter sentiment analysis. Two approaches are tested to measure variations in the public opinion about retail brands. The first, a lexicon-based method, uses a dictionary of words with assigned to them semantic scores to calculate a final polarity of a tweet, and incorporates part of speech tagging. The second, machine learning approach, tackl...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2015